##Data Analysis #2
This document contains an analysis that aims to determine the consequences and effectiveness of harvesting abalones based on age predictions derived from the abalone’s own physical attributes. More specifically, using an abalones volume as a decision criteria is explored below. There are several options as to exactly how an abalone’s volume should be considered and the exact harvesting strategy will depend on how often you are comfortable mistaking an infant abalone for an adult.
#### Section 1: (5 points) ####
(1)(a) Form a histogram and QQ plot using RATIO. Calculate skewness and kurtosis using ‘rockchalk.’ Be aware that with ‘rockchalk’, the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.
(1)(b) Tranform RATIO using log10() to create L_RATIO (Kabacoff Section 8.5.2, p. 199-200). Form a histogram and QQ plot using L_RATIO. Calculate the skewness and kurtosis. Create a boxplot of L_RATIO differentiated by CLASS.
(1)(c) Test the homogeneity of variance across classes using bartlett.test() (Kabacoff Section 9.2.2, p. 222).
##
## Bartlett test of homogeneity of variances
##
## data: RATIO by CLASS
## Bartlett's K-squared = 21.49, df = 4, p-value = 0.0002531
##
## Bartlett test of homogeneity of variances
##
## data: L_RATIO by CLASS
## Bartlett's K-squared = 3.1891, df = 4, p-value = 0.5267
Essay Question: Based on steps 1.a, 1.b and 1.c, which variable RATIO or L_RATIO exhibits better conformance to a normal distribution with homogeneous variances across age classes? Why?
You can see a difference in variance when comparing the histograms of RATIO between each classes. When comparing the histograms of log-RATIO there seems to be more consistent variance behavior.
This observation is confirmed with Barlett’s test. Non-log RATIO variances by CLASS have a p-value low enough to reject the homogeneous variance hypothesis for reasonable confidence levels (i.e. 99% confident). On the other hand, the p-value of the log-RATIO variances by CLASS do not have a very small p-value (only 0.52) which suggest we should continue with the hypothesis that log-RATIO values have equal variances across different age classes.
#### Section 2 (10 points) ####
(2)(a) Perform an analysis of variance with aov() on L_RATIO using CLASS and SEX as the independent variables (Kabacoff chapter 9, p. 212-229). Assume equal variances. Perform two analyses. First, fit a model with the interaction term CLASS:SEX. Then, fit a model without CLASS:SEX. Use summary() to obtain the analysis of variance tables (Kabacoff chapter 9, p. 227).
## Df Sum Sq Mean Sq F value Pr(>F)
## CLASS 4 1.055 0.26384 38.524 < 2e-16 ***
## SEX 2 0.091 0.04569 6.671 0.00132 **
## Residuals 1029 7.047 0.00685
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## CLASS 4 1.055 0.26384 38.370 < 2e-16 ***
## SEX 2 0.091 0.04569 6.644 0.00136 **
## CLASS:SEX 8 0.027 0.00334 0.485 0.86709
## Residuals 1021 7.021 0.00688
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Essay Question: Compare the two analyses. What does the non-significant interaction term suggest about the relationship between L_RATIO and the factors CLASS and SEX?
The aov test above are checking whether or not the variance in CLASS or SEX can explain the variance in L_RATIO (which is a measurement of meat per cubic centimeter of abalone volume). The small p-values for both CLASS and SEX suggest that L_RATIO is effected by changes in both CLASS and SEX, indicating we may be able to use CLASS and SEX as predictors for a abalone meat density.
(2)(b) For the model without CLASS:SEX (i.e. an interaction term), obtain multiple comparisons with the TukeyHSD() function. Interpret the results at the 95% confidence level (TukeyHSD() will adjust for unequal sample sizes).
Additional Essay Question: first, interpret the trend in coefficients across age classes. What is this indicating about L_RATIO? Second, do these results suggest male and female abalones can be combined into a single category labeled as ‘adults?’ If not, why not?
When you compare two age classes, the difference in means seems to grow as the age classes get further apart. For example, the age classes with the largest difference in L_RATIO means are the A1 and A5, followed up by A5 and A2). Age classes closer together (i.e. A1/A2, A2/A3, A3/A4, and A4/A5) show mean differences that are closer to 0 than all other age class comparisons.
The second family-wise chart is comparing the L_RATIO of the difference sexes. The difference in means between M and F abalones is close to zero, as indicated by its center mark being near the dotted line. This lends to the suggestion that we might want to combine the those two geneders into a single class.
#### Section 3: (10 points) ####
(3)(a1) Here, we will combine “M” and “F” into a new level, “ADULT”. The code for doing this is given to you. For (3)(a1), all you need to do is execute the code as given.
##
## ADULT INFANT
## 707 329
(3)(a2) Present side-by-side histograms of VOLUME. One should display infant volumes and, the other, adult volumes.
The adult distribution is more normal than the infants’. The infant abalones have a right skew, which makes sense given that they are younger and thus more concentrated near the smaller volumes. The box-plots shows that the IRQ of the two types (ADULT and INFANT) do not overlap. This separation may allow us to use volume to separate the two groups with a useful degree of certainty. Where the two groups’ IQRs almost overlap (the 250 cc volume mark) we will may want investigate to see if there is another way to separate the two groups.
Answer: (Enter your answer here.)
(3)(b) Create a scatterplot of SHUCK versus VOLUME and a scatterplot of their base ten logarithms, labeling the variables as L_SHUCK and L_VOLUME. Please be aware the variables, L_SHUCK and L_VOLUME, present the data as orders of magnitude (i.e. VOLUME = 100 = 10^2 becomes L_VOLUME = 2). Use color to differentiate CLASS in the plots. Repeat using color to differentiate by TYPE.
Additional Essay Question: Compare the two scatterplots. What effect(s) does log-transformation appear to have on the variability present in the plot? What are the implications for linear regression analysis? Where do the various CLASS levels appear in the plots? Where do the levels of TYPE appear in the plots?
Considering the CLASS charts, when we look at L_VOLUME and L_SHUCK the A1 age class seems to get pushed away from the mass of other values – this could help us during classification. Because the other classes are grouped together this suggests that abalone growth slows after reaching the more advanced age classes. This also suggest we be able to predict the A1 class more accurately than the other juvenile classes.
The log transform for TYPE on the other hand is not so helpful. The regression does get tighter, but the two gender types become much more intermingled when we apply the log transform to VOLUME and SHUCK.
#### Section 4: (5 points) ####
(4)(a1) Since abalone growth slows after class A3, infants in classes A4 and A5 are considered mature and candidates for harvest. You are given code in (4)(a1) to reclassify the infants in classes A4 and A5 as ADULTS.
##
## ADULT INFANT
## 747 289
(4)(a2) Regress L_SHUCK as the dependent variable on L_VOLUME, CLASS and TYPE (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2 and Black Section 14.2). Use the multiple regression model: L_SHUCK ~ L_VOLUME + CLASS + TYPE. Apply summary() to the model object to produce results.
##
## Call:
## lm(formula = L_SHUCK ~ L_VOLUME + CLASS + TYPE, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62316 -0.12500 0.00037 0.12891 0.71315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.83382 0.05001 -36.672 < 2e-16 ***
## L_VOLUME 0.99930 0.01026 97.377 < 2e-16 ***
## CLASSA2 -0.04146 0.02534 -1.636 0.102124
## CLASSA3 -0.10894 0.02872 -3.793 0.000158 ***
## CLASSA4 -0.17450 0.03237 -5.391 8.67e-08 ***
## CLASSA5 -0.26968 0.03254 -8.288 3.56e-16 ***
## TYPEINFANT -0.04857 0.01770 -2.744 0.006180 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.191 on 1029 degrees of freedom
## Multiple R-squared: 0.9504, Adjusted R-squared: 0.9501
## F-statistic: 3287 on 6 and 1029 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = L_SHUCK ~ L_VOLUME + CLASS, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.61633 -0.12430 0.00142 0.12785 0.74263
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.902940 0.043334 -43.913 < 2e-16 ***
## L_VOLUME 1.006462 0.009956 101.087 < 2e-16 ***
## CLASSA2 -0.036633 0.025359 -1.445 0.14887
## CLASSA3 -0.090752 0.028035 -3.237 0.00125 **
## CLASSA4 -0.148445 0.031039 -4.782 1.98e-06 ***
## CLASSA5 -0.243563 0.031213 -7.803 1.48e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1917 on 1030 degrees of freedom
## Multiple R-squared: 0.9501, Adjusted R-squared: 0.9498
## F-statistic: 3918 on 5 and 1030 DF, p-value: < 2.2e-16
Essay Question: Interpret the trend in CLASS levelcoefficient estimates? (Hint: this question is not asking if the estimates are statistically significant. It is asking for an interpretation of the pattern in these coefficients, and how this pattern relates to the earlier displays).
The trend seems to be that as class increases (from 2 to 5) so too does the magnitude of the coefficient. This aligns with the previous charts as it indicates that as class goes up so too does SHUCK weight.
Overall, this trend is beneficial to us because our goal is to harvest abalones for their meat. The older the abalone gets (higher age class) the larger the meat weight (a.k.a SHUCK) gets. This gives us another reason to avoid harvesting infants: they are much less likely to have a high amount of meat compared to their adult counterparts.
Additional Essay Question: Is TYPE an important predictor in this regression? (Hint: This question is not asking if TYPE is statistically significant, but rather how it compares to the other independent variables in terms of its contribution to predictions of L_SHUCK for harvesting decisions.) Explain your conclusion.
TYPE is not an important predictor for meat content. While TYPE does contain some predictive power, its coefficient is nearly five times as small as the class A5’s predictive contribution. If we take TYPE out of our model and rerun it, the adjusted R-squared value (which accounts for adding and removing predictors) goes from 0.9501 to 0.9498 (meaning that TYPE only contributed to 0.0003 to improving the R value thus doesn’t help explain variance in the dependent variable).
The next two analysis steps involve an analysis of the residuals resulting from the regression model in (4)(a) (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2).
#### Section 5: (5 points) ####
(5)(a) If “model” is the regression object, use model$residuals and construct a histogram and QQ plot. Compute the skewness and kurtosis. Be aware that with ‘rockchalk,’ the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
(5)(b) Plot the residuals versus L_VOLUME, coloring the data points by CLASS and, a second time, coloring the data points by TYPE. Keep in mind the y-axis and x-axis may be disproportionate which will amplify the variability in the residuals. Present boxplots of the residuals differentiated by CLASS and TYPE (These four plots can be conveniently presented on one page using par(mfrow..) or grid.arrange(). Test the homogeneity of variance of the residuals across classes using bartlett.test() (Kabacoff Section 9.3.2, p. 222).
##
## Bartlett test of homogeneity of variances
##
## data: model_residuals by CLASS
## Bartlett's K-squared = 3.6484, df = 4, p-value = 0.4557
##
## Bartlett test of homogeneity of variances
##
## data: model_residuals by TYPE
## Bartlett's K-squared = 0.53395, df = 1, p-value = 0.465
Essay Question: What is revealed by the displays and calculations in (5)(a) and (5)(b)? Does the model ‘fit’? Does this analysis indicate that L_VOLUME, and ultimately VOLUME, might be useful for harvesting decisions? Discuss.
The mean residual for all CLASS and TYPE instances seem to be centered around zero and appear to be normally distributed. Confirming this is the Bartlett test which has a high p-value, leading us to support the hypothesis that the variance of residuals is homogeneous.
What this means is that for predictions of SHUCK we can expect the error contributed by an abalone’s VOLUME to be normally distributed across all predictions. This is useful because we can quantify how incorrect our prediction may be by using the normal curve. If we can quantify the error we can set limits that ensure we stay above a certain level of error most of the time.
Harvest Strategy:
There is a tradeoff faced in managing abalone harvest. The infant population must be protected since it represents future harvests. On the other hand, the harvest should be designed to be efficient with a yield to justify the effort. This assignment will use VOLUME to form binary decision rules to guide harvesting. If VOLUME is below a “cutoff” (i.e. a specified volume), that individual will not be harvested. If above, it will be harvested. Different rules are possible.The Management needs to make a decision to implement 1 rule that meets the business goal.
The next steps in the assignment will require consideration of the proportions of infants and adults harvested at different cutoffs. For this, similar “for-loops” will be used to compute the harvest proportions. These loops must use the same values for the constants min.v and delta and use the same statement “for(k in 1:10000).” Otherwise, the resulting infant and adult proportions cannot be directly compared and plotted as requested. Note the example code supplied below.
#### Section 6: (5 points) ####
(6)(a) A series of volumes covering the range from minimum to maximum abalone volume will be used in a “for loop” to determine how the harvest proportions change as the “cutoff” changes. Code for doing this is provided.
(6)(b) Our first “rule” will be protection of all infants. We want to find a volume cutoff that protects all infants, but gives us the largest possible harvest of adults. We can achieve this by using the volume of the largest infant as our cutoff. You are given code below to identify the largest infant VOLUME and to return the proportion of adults harvested by using this cutoff. You will need to modify this latter code to return the proportion of infants harvested using this cutoff. Remember that we will harvest any individual with VOLUME greater than our cutoff.
## [1] 526.6383
## [1] 0.2476573
## [1] 0
(6)(c) Our next approaches will look at what happens when we use the median infant and adult harvest VOLUMEs. Using the median VOLUMEs as our cutoffs will give us (roughly) 50% harvests. We need to identify the median volumes and calculate the resulting infant and adult harvest proportions for both.
## [1] 0.4982699
## [1] 0.9330656
## [1] 0.02422145
## [1] 0.4993307
(6)(d) Next, we will create a plot showing the infant conserved proportions (i.e. “not harvested,” the prop.infants vector) and the adult conserved proportions (i.e. prop.adults) as functions of volume.value. We will add vertical A-B lines and text annotations for the three (3) “rules” considered, thus far: “protect all infants,” “median infant” and “median adult.” Your plot will have two (2) curves - one (1) representing infant and one (1) representing adult proportions as functions of volume.value - and three (3) A-B lines representing the cutoffs determined in (6)(b) and (6)(c).
Essay Question: The two 50% “median” values serve a descriptive purpose illustrating the difference between the populations. What do these values suggest regarding possible cutoffs for harvesting?
If you were to make your cuttoff volume the median infant volume you would be discarding 12% of the adults you collect – you also would be incorrectly classifying 50% of the juveniles abalones as adults. On the other hand, if you were to harvest at the median adult volume you would preserve over 95% of the infant abalones. This comes at a price, though, as you would be discarding many more adult abalones collected and this could possibly make the harvest unprofitable.
There is likely a “middle ground” cuttoff that is between these two median values that optimizes the number infant abalones miss-classified as adults and adult abalones miss-classified as infants.
More harvest strategies:
This part will address the determination of a cutoff volume.value corresponding to the observed maximum difference in harvest percentages of adults and infants. In other words, we want to find the volume value such that the vertical distance between the infant curve and the adult curve is maximum. To calculate this result, the vectors of proportions from item (6) must be used. These proportions must be converted from “not harvested” to “harvested” proportions by using (1 - prop.infants) for infants, and (1 - prop.adults) for adults. The reason the proportion for infants drops sooner than adults is that infants are maturing and becoming adults with larger volumes.
#### Section 7: (10 points) ####
(7)(a) Evaluate a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value. Compare to the 50% “split” points determined in (6)(a). There is considerable variability present in the peak area of this plot. The observed “peak” difference may not be the best representation of the data. One solution is to smooth the data to determine a more representative estimate of the maximum difference.
(7)(b) Since curve smoothing is not studied in this course, code is supplied below. Execute the following code to create a smoothed curve to append to the plot in (a). The procedure is to individually smooth (1-prop.adults) and (1-prop.infants) before determining an estimate of the maximum difference.
(7)(c) Present a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value with the variable smooth.difference superimposed. Determine the volume.value corresponding to the maximum smoothed difference (Hint: use which.max()). Show the estimated peak location corresponding to the cutoff determined.
Include, side-by-side, the plot from (6)(d) but with a fourth vertical A-B line added. That line should intercept the x-axis at the “max difference” volume determined from the smoothed curve here.
(7)(d) What separate harvest proportions for infants and adults would result if this cutoff is used? Show the separate harvest proportions. We will actually calculate these proportions in two ways: first, by ‘indexing’ and returning the appropriate element of the (1 - prop.adults) and (1 - prop.infants) vectors, and second, by simply counting the number of adults and infants with VOLUME greater than the vlume threshold of interest.
Code for calculating the adult harvest proportion using both approaches is provided.
## [1] 0.7416332
## [1] 0.7416332
There are alternative ways to determine cutoffs. Two such cutoffs are described below.
#### Section 8: (10 points) ####
(8)(a) Harvesting of infants in CLASS “A1” must be minimized. The smallest volume.value cutoff that produces a zero harvest of infants from CLASS “A1” may be used as a baseline for comparison with larger cutoffs. Any smaller cutoff would result in harvesting infants from CLASS “A1.”
Compute this cutoff, and the proportions of infants and adults with VOLUME exceeding this cutoff. Code for determining this cutoff is provided. Show these proportions. You may use either the ‘indexing’ or ‘count’ approach, or both.
(8)(b) Next, append one (1) more vertical A-B line to our (6)(d) graph. This time, showing the “zero A1 infants” cutoff from (8)(a). This graph should now have five (5) A-B lines: “protect all infants,” “median infant,” “median adult,” “max difference” and “zero A1 infants.”
#### Section 9: (5 points) ####
(9)(a) Construct an ROC curve by plotting (1 - prop.adults) versus (1 - prop.infants). Each point which appears corresponds to a particular volume.value. Show the location of the cutoffs determined in (6), (7) and (8) on this plot and label each.
(9)(b) Numerically integrate the area under the ROC curve and report your result. This is most easily done with the auc() function from the “flux” package. Areas-under-curve, or AUCs, greater than 0.8 are taken to indicate good discrimination potential.
## [1] 0.8666894
#### Section 10: (10 points) ####
(10)(a) Prepare a table showing each cutoff along with the following: 1) true positive rate (1-prop.adults, 2) false positive rate (1-prop.infants), 3) harvest proportion of the total population
To calculate the total harvest proportions, you can use the ‘count’ approach, but ignoring TYPE; simply count the number of individuals (i.e. rows) with VOLUME greater than a given threshold and divide by the total number of individuals in our dataset.
| Name | Volume | TPR | FPR | TotalHarvest |
|---|---|---|---|---|
| Protect all infants | 526.638 | 0.248 | 0.000 | 0.179 |
| Median infants | 133.821 | 0.933 | 0.498 | 0.812 |
| Median adults | 384.558 | 0.499 | 0.024 | 0.367 |
| Max difference | 262.143 | 0.742 | 0.176 | 0.584 |
| No A1 infants | 206.786 | 0.826 | 0.287 | 0.676 |
Essay Question: Based on the ROC curve, it is evident a wide range of possible “cutoffs” exist. Compare and discuss the five cutoffs determined in this assignment.
The five cuttoffs represent different ways to optimization a harvest:
Largest infant (526 cc)
This cuttoff represents avoiding harvesting any infants. If your goal was to have minimal impact on an abalone population, this may be your best choice. Unfortunately, this cuttoff allows for the smallest amount of adults collected.
Median Adult (384 cc)
The median adult volume is high enough that, if it was your harvest cuttoff, you would start to collect non-trivial amounts of adults. Specifically at this cutoff you would be keeping roughly 50% of the adults you collect. This cuttoff is very close to the the largest infant number in terms of FPR (0% compared to 2.4%) – this means that you would be incorrectly classifying an infant as an adult only 2.4% of the time.
The major issue with this cutoff is that you are incorrectly classifying roughly 50% of the adult abalones as infants. This approach is beneficial for infants, but it means you are not doing too well in terms of harvest efficiency.
Max Difference (262 cc)
This cutoff is the point where the TRP and the FPR are the furthest apart (compared to any other cutoff). This volume threshold can be thought of as where you get the most “bang for your buck”. At this level you are correctly classifying adult abalones with ~75% accuracy and you only misclassify an infant abalone ~18% of the time.
Even though this threshold allows for ~75% accuracy in classifying adults, you will only be harvesting ~58% of the abalones you collect. This may seem low but it reflects that ~30% of the abalones in the survey dataset are infants:
| Precent of population | |
|---|---|
| ADULT | 72.1 |
| INFANT | 27.9 |
Median infant volume (133 cc)
This is the most aggressive cuttoff in terms of high TPR (93%). The cost of such a high adult harvest rate is a, possibly, dangerously high FPR (~50%) for harvesting infant abalones that should be returned to the water.
No A1 infants (206 cc)
This volume cuttoff is higher the max difference (it is 8% better in TPR than the max difference cuttoff) and is 11% worse at incorrectly harvesting (FPR) infant abalones. This trade-off yields a 9.2% incense in total abalones harvested.
This cutoff is interesting in that it completely protects the A1 class. This could allow those infants to mature during the periods between harvest and could offer an opportunity to increase harvest rate without having a disastrous effect on the underlying abalone population.
Final Essay Question: Assume you are expected to make a presentation of your analysis to the investigators How would you do so? Consider the following in your answer:
The data and analysis collected in this study likely could influence real world decisions that would have consequences effecting people’s livelihoods. Thus, as I am not a biologist or abalone expert, I am hesitant to make a strong recommendations and I would lean toward presenting options that are derived from the data. Each recommendation would come with context and be predicated on a certain value assumption (i.e. this option is best if you want to protect the infant abalones). It would be interesting to present this to someone who was an expert in abalone harvesting as it might lead to a fruitful conversation where they could learn from me and I could learn from them.
If it were necessary to proceed based on the current analysis, I would suggest using the max difference cutoff. The reason is that this cuttoff provides a productive harvest while limiting the damage to the infant abalone population. If I could confirm that not harvesting the A1 class would preserve the reproductive health of the underlying abalone population, I would suggest using the no-A1-infant cutoff.
A goal of future studies could be to increase attributes collected about the abalones being studied. If we could find more features that could help predict the class of an abalone, we could increase the total harvest amount. For example: location of abalone collected, species of abalone, temperature of water, and water algae content all could possibly be predictors for how old an abalone is. Another thing to try in future studies would be to use more advanced prediction techniques like decision trees.